Ranked Queries in Index Data Structures
نویسندگان
چکیده
A ranked query is a query which returns the top-ranking elements of a set, sorted by rank, where the rank corresponds to some sort of preference function defined on the items of the set. This thesis investigates the problem of adding rank query capabilities to several index data structures on top of their existing functionality. Among the data structures investigated are suffix trees, range trees, and hierarchical data structures. We explore the problem of additionally specifying rank when querying these data structures. So, for example in the case of suffix trees, we would like to obtain not all of the occurrences of a substring in a text or in a set of documents, but to obtain only the most preferable results. What is most important, the efficiency of such a query must be proportional to the number of preferable results and not all of the occurrences, which can be too many to process efficiently. First, we introduce the concept of rank-sensitive data structures. Rank-sensitive data structures are defined through an analogy to output-sensitive data structures. Output-sensitive data structures are capable of reporting the items satisfying an online query in time linear to the number of items returned plus a sublinear function of the number of items stored. Rank-sensitive data structures are additionally given a ranking of the items and just the top k best-ranking items are reported at query time, sorted in rank order. The query must remain linear only with respect to the number of items returned, which this time is not a function of the elements satisfying the query, but the parameter k given at query time. We explore several ways of adding rank-sensitivity to different data structures and the different trade-offs which this incurs. Adding rank to an index query can be viewed as adding an additional dimension to the indexed data set. Therefore, ranked queries can be viewed as multidimensional range queries, with the notable difference that we additionally want to maintain an ordering along one of the dimensions. Most range data structures do not maintain such an order, with the exception of the Cartesian tree. The Cartesian tree has multiple applications in range searching an other fields, but is rarely used for indexing due to its rigid structure which makes it difficult to use with dynamic content. The second part of this work deals with overcoming this rigidness and describes the first efficient dynamic version of the Cartesian tree.
منابع مشابه
Ranked Document Retrieval in (Almost) No Space
Ranked document retrieval is a fundamental task in search engines. Such queries are solved with inverted indexes that require additional 45%-80% of the compressed text space, and take tens to hundreds of microseconds per query. In this paper we show how ranked document retrieval queries can be solved within tens of milliseconds using essentially no extra space over an in-memory compressed repre...
متن کاملبهبود الگوریتم انتخاب دید در پایگاه داده تحلیلی با استفاده از یافتن پرس وجوهای پرتکرار
A data warehouse is a source for storing historical data to support decision making. Usually analytic queries take much time. To solve response time problem it should be materialized some views to answer all queries in minimum response time. There are many solutions for view selection problems. The most appropriate solution for view selection is materializing frequent queries. Previously posed ...
متن کاملTerm-Frequency Surrogates in Text Similarity Computations
Inverted indexes on external storage perform best when accesses are ordered and data is read sequentially, so that seek times are minimized. As a consequence, the various items required to compute Boolean, ranked and phrase queries are often interleaved in the inverted lists. While suitable for query types in which all items are required, this arrangement has the drawback that other query types...
متن کاملTop-k Ranked Document Search in General Text Databases
Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words...
متن کاملFuzzy retrieval of encrypted data by multi-purpose data-structures
The growing amount of information that has arisen from emerging technologies has caused organizations to face challenges in maintaining and managing their information. Expanding hardware, human resources, outsourcing data management, and maintenance an external organization in the form of cloud storage services, are two common approaches to overcome these challenges; The first approach costs of...
متن کامل